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The partial least squares procedure was originally developed to 
estimate the slope parameter in multivariate parametric models. More 
recently it has gained popularity in the functional data literature. 
There, the partial least squares estimator of slope is either used to 
construct linear predictive models, or as a tool to project the data 
onto a one-dimensional quantity that is employed for further statis- 
tical analysis. Although the partial least squares approach is often 
viewed as an attractive alternative to projections onto the principal 
component basis, its properties are less well known than those of the 
latter, mainly because of its iterative nature. We develop an explicit 
formulation of partial least squares for functional data, which leads 
to insightful results and motivates new theory, demonstrating consis- 
tency and establishing convergence rates. 

1. Introduction. Partial least squares (PLS) is an iterative procedure for 
estimating the slope of linear models. The technique was originally devel- 
oped in high-dimensional and collinear multivariate settings and is especially 
popular in chemometrics. See Wold (1975), Martens and Naes (1989), Hel- 
land (1990), Frank and Friedman (1993), Garthwaite (1994), Goutis and 
Fearn (1996), Durand and Sabatier (1997) and Nguyen and Rocke (2004). 

The iterative nature of PLS can make it difficult to uncover properties in 
a clear and explicit way, and for a long time PLS was regarded as a tech- 
nique that worked well, but whose properties were relatively obscure. Early 
theoretical developments of multivariate PLS can be found in Lorber, Wan- 
gen and Kowalski (1987) and Hoskuldsson (1988), and further developments 
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include those of Phatak, Reilly and Penlidis (2002), Phatak and de Hoog 
(2003), Bro and Elden (2009) and Kramer and Sugiyama (2011). 

More recently, the method has been applied in the functional data context 
by Preda and Saporta (2005a), who suggest using PLS for estimating slope 
in functional linear models; see also Reiss and Ogden (2007). Also in the 
functional setting, the intrinsic iterative nature of PLS has made it difficult 
to develop intuition and derive clear and explicit theoretical properties. In 
this paper we provide a transparent account of theoretical issues that under- 
pin PLS methods in linear models for prediction from functional data, and 
show that they motivate an alternative formulation of PLS in that setting. 
This "alternative PLS," which we refer to here as APLS, has the advantage 
that it is expressed only in terms of functions that are explicitly computable. 
These attributes make APLS particularly attractive, relative to the conven- 
tional PLS formulation, and permit detailed theoretical development. 

We give concise stochastic expansions for the difference between estima- 
tors derived using APLS, and the quantities to which these estimators con- 
verge in the large-sample limit. These expansions are valid uniformly in 
estimators based on the first O^ 1 / 2 ) APLS basis functions, where n de- 
notes sample size. The expansions also lead easily and directly to a variety 
of results about our estimators, including convergence rates and central limit 
theorems. 

Besides functional linear models, PLS is employed in a variety of other 
data functional problems. For example, Ferraty and Vieu (2006) use it to 
define a semi-metric for nonparametric functional predictors or classifiers; 
Escabias, Aguilera and Valderrama (2007) employ PLS with logit regression; 
Preda, Saporta and Leveder (2007) and Delaigle and Hall (2012) use it for 
functional data classification. See also Preda and Saporta (2005b), Kramer, 
Boulesteix and Tutz (2008) and Aguilera et al. (2010). 

2. Functional linear models. 

2.1. General bases for inference in functional linear models. Let X = 
{{X\ , Y\), . . . , (X n ,Y n )} denote a sample of independent data pairs, all dis- 
tributed as (X,Y), where X is a random function defined on the nondegen- 
erate, compact interval I and satisfying J X E(X 2 ) < oo, and Y is a scalar 
random variable generated by the linear model 



Here, a denotes a scalar parameter, e is a scalar random variable with finite 
mean square and satisfying E(e \ X) = and b, a function-valued parameter, 
is a square-integrable function on I. 



(2.1) 
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Predicting the value of Y, given X, amounts to estimating the function 
(2.2) g(x) = E(Y \X = x) = a + J bx, 

which, itself, requires us to estimate the scalar a and the function b from the 
data. A standard approach is to express X and b in terms of an orthonormal 
basis ipi , ip2 1 • • • defined on X. Expansions for X and b in this basis can be 
written as X = YljiJxXiftj^j an d b = Ylj v ji J ji where Vj = fjbipj. Since, 
in practice, we can calculate only a finite number of terms, the infinite- 
dimensional expansion for b is approximated by a sum of p terms, where p > 1 
is an integer, and each term of this sum is then estimated from the data. Note 
that J x bX = ^2 ■ Vj j x Xipj , which motivates us to take a = E{Y) — j x bE{X) 
and define f3\ , . . . , f3 p to be the sequence vi, . . . ,v p that minimizes 

2 



(2.3) s p {v u ...,v p ) = e\J^{X- EX) - £ Vj JjX - EX)^ 



The functions 

v 
i=i 

g p (x) = E(Y) + / b p (x - EX) = E(Y) + V Pj [ (x - EX)^ 
Jz j=1 Jx 

are approximations to b and to g(x), respectively. Their accuracy, as p in- 
creases, depends on the choice of the sequence ipi,ijj2, 

Sometimes the basis is chosen independently of the data (e.g., sine-cosine 
basis, spline basis, etc). Then the functions ifij are known, and an empirical 
version of (2.4) is obtained by replacing the scalars j3\, . . . ,/3 p by the sequence 
vi, . . . ,v p that minimizes 

(2.5) n-^S^Yi-Y-J^VjJ^Xi-X)^ . 

A drawback of such bases is that there is no reason why their first p ele- 
ments should capture the most important information about the regression 
function g, available from the data. It seems more attractive to use bases 
that adapt to the properties of the population represented by the data. We 
discuss two such adaptive bases in Sections 2.2 and 2.3, respectively. 

2.2. Principal component basis. One of the most popular adaptive bases 
is the so-called principal component basis, constructed from the covariance 
function K(s,t) = cov{X(s),X(t)} of the random process X. As is com- 
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mon in mathematical analysis, we shall use the notation K also for the lin- 
ear transformation (a functional) that takes a square-integrable function ip 
to K(ip) given by K(ip)(t) = J T ip(s)K(s,t) ds. 

Since f x E(X 2 ) < oo, then J x K(t, t) dt < oo, and we can write the spectral 
decomposition of K as 

oo 

(2.6) K{s,t) = Y J h^k{s)^ k {t), 

k=i 

where the principal component basis (fi\,(j>2,--- is a complete orthonormal se- 
quence of eigenvectors (i.e., eigenfunctions) of the transformation K, with re- 
spective nonnegative eigenvalues 61,62, That is, K((j>k) = 9k4>k for k > 1. 

Positive definiteness of K implies that the eigenvalues are nonnegative, and 
the condition f x E(X 2 ) < 00 entails Ylk@k < 00 • Therefore we can, and do, 
order the terms in the series in (2.6) so that 

(2.7) 6 1 >6 2 > ■■■>(). 

In practice the scalars 9j and the functions cpj are unknown and are esti- 
mated from the data, as follows. First, the covariance function is estimated 
by 

l n 

(2.8) K(s,t) = -Y,{Ms)-X(s)}{X t (t)-X(t)}, 

i=l 

where X{t) = re -1 Ya=i Xi(t). Then, 6±, . . . ,6 n and (pi, . . . , <p n are estimated 

by the eigenvalues Q\ > 62 > • • • 6 n > and the eigenfunctions 4>i , . . . , <f> n of 
the transformation represented by K, which can have at most n nonzero 
eigenvalues. Finally, an empirical version of j3\, . . . , f3 p is defined to be the se- 
quence v\, . . . ,v p that minimizes (2.5), where each ipj there is replaced by <j>j. 
Then, g p at (2.4) is replaced by its corresponding empirical version. In the 
rest of this paper, to avoid confusion with projections of b onto other bases, 
we shall add a superscript PC to coefficients obtained from projection of b 
onto one of the functions <f>j\ that is, we shall use the notation j3j c = f x b(j)j. 

The literature on functional linear models based on principal component 
analysis (PCA) is large. It includes, for example, work by Cai and Hall 
(2006), Reiss and Ogden (2007), Apanasovich and Goldstein (2008), Cardot 
and Sarda (2008), Baillo (2009), Miiller and Yao (2010), Wu, Fan and Miiller 
(2010) and Yao and Miiller (2010). 

2.3. The orthonormal PLS basis. The principal component basis intro- 
duced in Section 2.2 is defined in terms of the population, but only through X . 
In particular, while its first p elements <pi, . . . ,(j> p usually contain most of 
the information related to the covariance of X, these are not necessarily 
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important for representing b, and all or some of the most important terms 
accounting for the interaction between b and X might come from later prin- 
cipal components. In prediction, to capture the main effects of interaction 
using only a few terms, one could construct the basis in a way that takes 
this interaction into account. 

Motivated by such considerations, the standard PLS basis, adapted to 
the functional context, is defined iteratively by choosing vjj p in a sequential 
manner, to maximize the covariance functional 



(2.9) 

subject to 



(2.10) 



fp{i> P ) = cov|y-5p_i(x), J^xi/j p X, 



i/)j(s)K(s,t)ip p (t) dsdt = forl<j'<p— 1 and 



it J x 

1, 



lirpi 

where || • || is a norm (see Section 3.1), and given that ipi, . . . ,ip p -i have 
already been chosen. [Recall that g p was defined at (2.4).] In practice, the 
covariances in (2.9) are replaced by estimates, and empirical versions of 
the ipj's are constructed by an iterative algorithm described in Appendix A. 2. 

Partial least squares can also be used for prediction in nonlinear models, 
where the basis that it produces is sometimes, but not always, effective for 
prediction. Specifically, although the PLS basis enables a consistent approx- 
imation to g in such large number of terms may be required to get 
a good approximation. 

3. Properties of theoretical functional partial least squares. For predic- 
tion and estimation of 6, the PLS basis is sometimes preferred to the PCA 
basis, partly because it can often capture the relevant information with fewer 
terms; see our data illustrations in Section 6. Detailed theoretical properties 
for inference in functional linear models based on the PCA basis have been 
studied in a number of papers, but few results exist about their functional 
PLS counterpart. In this section we provide new insight into the theoreti- 
cal PLS basis, defined in (2.9) and (2.10), and give an explicit description 
of the space generated by the first p PLS basis functions tpi, . . . ,ip p . These 
properties motivate an alternative formulation of functional PLS, which we 
call APLS. It permits us to define the functional PLS basis very simply, and 
to construct an explicitly defined algorithm to implement empirical PLS; 
see Section 4. The explicit nature of the algorithm will allow us to derive 
detailed theoretical properties of empirical functional PLS, including con- 
vergence rates; see Section 5. 



3.1. Explicit form of the orthonormal PLS basis. Our first result, The- 
orem 3.1, below, gives an explicit account of the constrained optimiza- 
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tion described in Section 2.3. We use the following notation. Given a.\ 
and «2 in the class C{X) of all square-integrable functions on X, write 
f x f x ct\a.<iK to denote J x f x ai(s)ot2(t)K(s,t) ds dt. For any x £ C(Z), de- 
fine ||x|| 2 = j x f x xxK. (Some implementations of PLS, e.g., the one in Ap- 
pendix A. 2, take ||x|| 2 = f x x 2 , but this affects only the scale, not the main 
properties of the functions tpj.) 

Theorem 3.1. If f x E(X 2 ) < oo, then the function ip p that maximizes f 
at (2.9), given ip±, . . . , ip p -i and subject to (2.10), is determined by 



(3.1) ^p = co 



p~ i / \ ^ p— i 



' J k=l 



where, for 1 < k <p — 1, the constants Ck are obtained by solving the linear 
system of p — 1 equations 



(3.2) J J1>rf p K = Q, j = l,...,p-l, 

and where cq is defined uniquely, up to a sign change, by the property 

(3.3) lhM = i- 



One of the interesting implications of the theorem is that for each j, the 
jth basis function determined by PLS can be expressed as a linear combi- 
nation of j explicitly defined functions. More precisely, the theorem implies 
that ipi = d\K(b), where, by (3.3) with p = 1, d\ = (6)|| - , and more gen- 
erally, the following properties follow from the representation (3.1); the first 
property implies the second: 

(a) For each p > 1, and given ipi, . . . ,ifj p _i, the function ip p is the 
linear combination of K(b), . . . , K p (b) for which (2.10) holds, and 
(3.4) is unique up to a sign change, (b) For each p>l, representing 
a function as a linear form in ip\ , . . . , ip p is equivalent to representing 
it as a linear combination of K(b), . . . , K p (b). 

These properties motivate the APLS formulation and underpin the rest of 
the paper. Interestingly, (3.4) continues to hold if equations (3.2) are re- 
placed by j x ijjjijj p = for j = 1, . . . ,p — 1. In particular, although the func- 
tions ip2f-,ipp w ill change in this case, the spaced spanned by ip± , . . . , ip p 
will not alter. 

Result (3.4) is a deterministic functional version of a known result for 
empirical PLS in the multivariate context. More specifically, suppose we 
have n observations of a g-variate (nonfunctional) predictor of a variable Y, 
let X be the n x q matrix of observations on the predictor, and let y be the 
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n x 1 vector containing the observations on Y. Then it has been established 
that the space spanned by the first p empirical PLS components is equal to 
the space generated by X T y,DX T y, . . . , D p_1 X T y, where D = X T X. See, 
for example, Bro and Elden (2009), and compare the empirical algorithm in 
Section 4.1. This is itself a particular case of results that are available more 
generally in Krylov spaces, although again in the multivariate rather than 
functional setting, that is the subject of this paper. 

3.2. Expansions in a nonorthogonal PLS basis. The properties at (3.4) 
give a clear and explicit account of the form taken by the PLS basis functions. 
For example, they show that for each p, the space generated by ipi, . . . , ip p is 
the same as the space generated (i.e., spanned) by K(b), . . . , K p (b). Note that 
the functions K^{b) are explicitly defined, since we have K^{b) = 0^/3% <f>k, 
where cj>k is the kth PCA basis function. 

Next, if we note that a = E(Y) — f x bE(X) and define 71 , . . . , j p to be the 
sequence w\,...,w p that minimizes 

(3.5) t p {w 1 ,...,w p ) = E^ J {X - EX)b ~Y2 w i J ^ X ~ EX ) RJ ( 6 ) j 

[compare (2.3)], then the slope function approximation b p at (2.4) has two 
equivalent expressions, 

(3.6) b p = J2^(b) = J2Pj^ 

3=1 3=1 

where /3i, . . . , f3 p are as defined in Section 2.1 if we take the general ip±, . . . , tp p 
introduced there to be the specific functions given by Theorem 3.1. 
In matrix notation, 



(3.7) j = (7i,...,7 P ) T = H 1 (a 1 ,...,a 
where H = (/ijfc)i<j i fc< p denotes a p x p matrix, 

r. OO 

(3.8) h jk = j^ +1 (b)K k (b) = J> r PC ) 2 ^ 



T 

pj j 



i+fc+1 



r=l 

oc 



(3.9) 



a j= / K{b)K\b) = V(/3 r PC ) 2 ^ +1 = / l0i . 



Here we have used the fact that, for p fixed, the matrix H is nonsingular 
because, for finite p, the equivalence of the expansion in the orthogonal 
basis tpi, . . . , ip p and in the basis K(b), . . . ,K' p (b) implies that the sequence 
71, . . . ,7 P that minimizes (3.5) is unique. See also our discussion on Hankel 
matrices in Section 5.3. 
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The pth order approximation g p (x) to g(x) = E(Y \ X = x), resulting from 
the pth order approximation of b by either of the identities at (3.6), is given 
equivalently by the second formula at (2.4) or by the expression 

(3.10) g p {x) = a+ I ' b p x = E{Y) + Ylj [ (x - EX)K j {b). 

Jx Jx 

We denote by APLS the formulation of PLS based on the sequence K(b), . . . , 
KP(b). 

For the approximation at (3.6) to converge to 6, that function should be 
expressible as a linear form in K(b),K 2 (b), . . . , 

oo 

(3.11) b = J2^J Kj (b), 

i=i 

where the Wj's are constants, and the series converges in L 2 . The next 
theorem gives conditions under which, for a general b in C(I), there exist 
wi,W2,... such that (3.11) holds. 

Theorem 3.2. If J X E(X 2 ) < oo, and the eigenvalues of K are all nonze- 
ro, then each b € C(I) can be written as at (3.11), where the series converges 
in L 2 . 

Under the side condition $ X E(X 2 ) < oo the assumption in Theorem 3.2 
that all eigenvalues of K be nonzero is both necessary and sufficient for (3.11) 
to hold for all 6 € C(X). However, if some eigenvalues 0j, corresponding to re- 
spective eigenvectors <pj , vanish, then the respective values of J X (X — EX)<pj 
vanish with probability 1, and so those indices make zero contribution to 
J X (X — EX)b = Y2j fz(X ~ EX)(f)j ■ Jj- b(pj . Therefore we can delete the com- 
ponents of b = <f>j J x b(j)j that correspond to indices j for which 0j = 0, 
without affecting the value of f x bX; and it is only through the latter in- 
tegral that b influences prediction. Therefore the theorem can be stated in 
a form which asserts that even if some of the eigenvalues of K vanish, the 
representation at (3.11) is sufficiently accurate for all purposes of prediction 
based on (2.1). The only reason we have not taken this course is to make 
our arguments relatively simple and transparent. 

Note that the Wj's in (3.11) are not determined uniquely. In particu- 
lar, (3.11) implies that K(b) = ^2jWjK^ +1 (b), and so the following ex- 
pansion, among many others, is an alternative to (3.11): b = X^iWi + 
Wj + i)K^ +l {b). This lack of uniqueness does not violate the equivalence 
noted in (3.4) (b), since that property is asserted only for a finite sequence 
ipij ... } ipp. However, it makes it impossible to treat usefully the relation- 
ship between the infinite expansion of a function b in terms of the se- 
quence K(b),K 2 (b), . . . , and its infinite expansion in terms of the PCA basis, 
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<j)\,4>2i ■ ■ ■ , introduced in Section 2.2. Nevertheless we can discuss the pth or- 
der PLS projection b p = ^'j^'JjK^ (b) of b onto the finite-dimensional space 
spanned by K(b), . . . , K p (b), for an arbitrary but fixed p > 1. 

To this end, recall that /3f c , fi^ C > • • • denote the Fourier coefficients of b 
with respect to the PC A basis 0i,<^2> Then, 

p p oo oo / p \ 

(3.12) 6 P >>,A-';M = E^E^ PC ^ = E^ PC l>0i 0*. 

i=i i=i fe=i fc=i \j=i / 

and the last series expresses b p in terms of the components of the PCA basis. 
4. Empirical implementation of APLS. 

4.1. Algorithm for empirical APLS. A standard algorithm for empirical 
implementation of PLS based on the sequence ipi,. ..,ip p is given in Ap- 
pendix A. 2. In this section we describe a simple empirical algorithm for 
implementing APLS based on the nonorthogonal sequence K(b), . . . , K p (b). 
As we shall see, this algorithm will permit simple derivation of theoretical 
properties of PLS. In Section 4.2 we shall deduce two algorithms that are 
numerically more stable. 

To estimate K(b), . . . , K p (b), first note that we can estimate K(b) by 

1 n i n 

K(b) = -J2 XrX ccnt = - ^(X t - X)(Yi - Y), 

i=l i=l 

where Xf ent = Xi — X and Y^ cnt = Yi — Y. Then, given an estimator K^(b) 
of Ki(b), we can estimate K^ +1 (b)(t) by K^ +1 {b){t) = f x Ri (b)(s)K(s,t) ds, 
where K is the conventional estimator of the covariance function, K(s,t) = 



n 



_1 £r=i{^( s ) - X(s)}{Xi(t) -X{t)}. Having calculated K^{b) for 1 < 



j < p we take 71 , • . • , 7p to minimize 

n ( p \^ 

(4.1) u p ( Wl ,...,w p ) = ±j2 hr nt -X>; / xr'^ib)] 

1=1 1 j=i Ji j 

with respect to wi, . . . , w p [compare (3.5)]. In matrix notation, 

(4.2) 7= (% . . . ,%) T = H-\a u . . . , a p ) T , 
where H = (hjk)i<j,k<p denotes a p x p matrix, 

(4.3) h jk = [ [ k{s,t)k j {b){s)K k {b){t)dsdt= [ K j+1 (b)K k (b), 



TJX 



(4.4) aj = J K(b)K ] (b). 
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Finally we construct an estimator of g based on (3.10), 

v r 

(4.5) g p (x)=Y + J2li (x-X)K^b). 

Remark 1. Formula (3.8) demonstrates that the theoretical version H 
of H is a symmetric matrix. Our estimator H does not necessarily enjoy 
that property, but an alternative estimator of hjk can be defined to sat- 
isfy it. More precisely we can take hjk = f x K^ +k (b)K '(b) , which produces 
a symmetric estimator H = (hjk) of H. We could use H in place of H, but 
computing hjk requires K to be iterated j + k times, whereas hjk needs 
iteration at most max(j + l,k) times. Therefore we prefer the version H. 

4.2. Stabilized algorithm for empirical APLS. The algorithm described 
in Section 4.1 would provide a good solution if we were able to work in 
exact arithmetic, but it can be unstable in finite precision arithmetic. This 
is because, due to the nonunicity of the expression for b in terms of the 
infinite series K(b),K 2 (b), . . . , as p increases the linear system of equations 
given by the empirical version of (3.5) [see (4.1)] becomes closer to singular. 
Therefore, in finite precision arithmetic, as p increases it becomes more 
difficult to numerically identify one or more of the valid expressions arising 
from a large number of terms in the sequence K(b),K 2 (b), .... 

There exist a number of numerical methods for overcoming this numerical 
difficulty. A simple approach is to transform the linear system of equations 
by Gram-Schmidt orthogonalization; see Section 7.7 of Lange (1999). There, 

the columns of the nxp matrix with (i,j)ih element equal to J x Xf ent K ] '(b) 
are transformed into p orthonormal vectors u\ , . . . , u p by the modified Gram- 
Schimdt algorithm (a numerically stabilized version of Gram-Schmidt al- 
gorithm; see Appendix A. 3). Instead of using 7 in (4.2), the sequence that 
minimizes (4.1) can then be computed by solving, with respect to w±, . . . , w p , 
the equivalent equation H(wi, . . . , w p ) T = U T (Y 1 cent , . . . , Y^ ent ) T , where U is 
a matrix with columns u±, . . . , u p , and R is an upper pxp triangular matrix. 
Let 7* = (7I, . . . ,7*) T be the solution of this equation. We can estimate g 

hyT P (*) = Y + EU 7* f T (x - X)Ki(b). 

Alternatively, having constructed K 3 (b) for 1 < j < p as in Section 4.1, we 
can also transform them into an orthonormal sequence ip± , . . . , tp p satisfying 
the standard PLS constraints, J x J x ipjipkK = for j ^ k [(compare (2.10)], 
using, for example, the modified Gram-Schmidt algorithm. Then we can 
calculate an empirical version /3i,...,j3 p of . . . ,/3 p , the latter defined in 
Section 2.1 (taking there the ij)j J s to be the empirical PLS basis functions), 
by finding the sequence v±, . . . ,v p that minimizes (2.5). Finally, we can esti- 
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mate g by 

* /■ 

(4.6) g p {x) = Y + J2Pj (x-Xtyj. 

3=1 h 

In exact arithmetic, g^ and 7* would be equal to, respectively, g p and 7 
denned in (4.5) and (4.2). Likewise, g p , would be equal to g p . In practice, 
these approximations differ because we can only work in finite precision 
arithmetic, and the algorithms leading to g^ and g p are much more nu- 
merically stable than the one leading to g p . In general, for prediction we 
found the algorithm leading to g p to be preferable. However, the algorithm 
of Section 4.1 is important for developing intuition and assembling theo- 
retical arguments. On the theoretical side, the simple, explicit formulae in 
Section 4.1 permit us to establish consistency and derive rates of conver- 
gence. Of course, the equivalence between g p , g p and g^ implies that, in 
order to derive the theoretical properties of g p and gt, it suffices to derive 
them for g p (all three have the same theoretical properties). On the intuitive 
side we note that the explicit formulation of the quantities involved in our 
empirical algorithms for APLS gives a much clearer account of what partial 
least-squares does, than the standard empirical iterative PLS algorithm in 
Appendix A. 2. 

5. Asymptotic properties of empirical APLS. 

5.1. Introduction. To our knowledge, the only existing theoretical results 
for functional PLS are those of Preda and Saporta (2005a), who state gen- 
eralizations to the functional data context of some results of Hoskuldsson 
(1988). Although they are of interest, the theoretical arguments there are 
iterative and not explicit, and consistency of the PLS approximation is men- 
tioned without a proof and without regularity conditions or convergence 
rates. This is because those results are based on the iterative empirical 
approximation of PLS, and the inexplicit form of the algorithm (see Ap- 
pendix A. 2) apparently makes it very difficult to derive explicit theoretical 
results. 

Our alternative formulation, APLS, of the functional partial least-squares 
problem permits us to derive many properties. As already explained in Sec- 
tion 4.2, the theoretical properties of the empirical approximations g* and g p 
in Section 4.2 are identical to those of g p in Section 4.1. 

5.2. Main results. Define \i = E{ X), a function, and observe that we can 
write: 

(5.1) K = K + n~ 1/2 C + n' 1 ?], K(b) = K{b) + n" 1/2 £o + n -1 ^, 
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where £ and rj are functions of two variables, £o an d % are functions of 
a single variable, each equals Op(l). More specifically, 

n 

£(s, t) = ^ J> - £0{Xi( a ) - n(s)}{Xi(t) - n(t)}, 

1=1 



1 

&>(*) = -T72 I> - *0 W*) " K*)HYi ~ E(Yi)}, 

ft i . 

1 = 1 

n(s, t) = -n{X(s) - v(s)}{X(t) - fM(t)}, 

m {t) = -n{X(t)-v(t)}(Y-EY). 

For any square-integrable function L of two variables, define |||L||| 2 = fx fx^ 2 
and put R x = |||A"||| +n~ 1/2 |||£||| + n _1 |h|||, R2 = \U\\\ + Nil- Define too 



(5.2) 0(i) = J K'(b)(s)as,t)d8 
and 

i-2 

(5.3) = + £jr fc (G-k-i). 

fc=0 



(5.4) 



li/iHiJS-'lliwII + ^E^"*" 1 ^ 6 )^!^*" 1 !^!!) 
fe=i 



+ J R 2 iii£inX;i?r fc - i x:iii^iiiiiK 



Theorem 5.1 below requires no assumptions beyond the model at (2.1), 
and the condition that 



(5.5) Jb 2 < 00, £||X|| 4 <oo, E(e 2 )<oo. 

[Recall that e, satisfying E(e j X) = 0, is the error in the model at (2.1).] 
Note that, under (5.5), it follows from (5.3) and (5.4) that ||£j|| + ||?7j|| = 

Op{n~ 1 / 2 ). Theorem 5.1 shows that the empirical approximations K^(b) to 
the basis functions used by APLS, converge in probability to their theoretical 
values Ki(b) at a rate n -1 / 2 . 

Theorem 5.1. If (5.5) holds then, for each j > 1, 

(5.6) K j (b) = K j (b) + n' 1 ' 2 ^ + n" 1 ^-, 
where £j is defined at (5.3) and rjj satisfies (5.4). 
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The next theorem shows that the matrix entries hjk, defined at (4.3), con- 
verge in probability to their theoretical counterparts hjk, at (3.8), at a rate 
n — 1/2 This theorem will be used to establish consistency of the empirical 
coefficients 7, used in the empirical APLS expansion at (4.5). Note that, 
since HI A ((1 2 = J2j0], the condit ion < 81 < I A' I imposed in Theorem 5.2 
is equivalent to asserting that at least two values of 6j are nonzero. The 
condition ||| A||| < 1 can be ensured by simply changing the scale on which X 
is measured, and so is imposed without loss of generality. 

Theorem 5.2. Assume (5.5), that 61,62,. ■ ■ is the eigenvalue sequence 
in the representation (2.6), ordered such that (2.7) holds, and that < 8\ < 
III A" I < 1. Then \\rjj\\ = Op(|||A|p) uniformly in 1 < j < Cn 1 ! 2 , and 

hjk = h jk + n" 1 ' 2 [ {tj +1 K k {b) + K* + \b)S k } 
Jx 

(5-7) 

+ O p (n^ 1 ^ 1 |||A||| fc +n- 2 |||A|P' +fc ), 
uniformly in 1 < j <k < Cn 1 / 2 as n — > 00, for each C > 0. 



Our next result, Theorem 5.3, applies Theorems 5.1 and 5.2 to derive 
a stochastic expansion for the difference between the theoretical approxi- 
mant g p (x), at (3.10), and its estimator g p (x), at (4.5). Let Aijk = 
j x {^j + \K k {b) + A J+1 (6)^}, denoting the coefficient of n" 1 / 2 in the expan- 
sion (5.7), and put Ai = (Ayfc), a p x p matrix, and 5 = (A101, ■ . ■ , Aio p ) T , 
a p-vector. Also, let A = X(p) be the smallest eigenvalue of the p x p matrix 
H = (hjk), introduced in Section 3.2. 

Theorem 5.3. Under the conditions of Theorem 5.2, and if each 0j > 0, 
(5.8) ||7 - {7 + n-V 2 H-\6 - A l7 )}|| = O p (n~ l \^), 



(x) - g p (x) 

= Y — EY + n 



(5.9) 



1/2 E 



{H- 1 (6-A 1 ^)} j / (x-EX)K\b) 



+ lj J i( x ~ EX )tj ~ nl/2 (X ~ EX)K j (b)} 
+ O p (n- l \~ 1 \\ 1 \\+n- l \- 3 ), 

uniformly in functions x and integers p for which \\x\\ < C , 1 < p < Cn 1 / 2 
and n l / 2 \ — > 00, where C > is fixed but arbitrary. 
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Note that, by (3.8), \h jk \ < e{ +k+1 \\b\\ 2 , and therefore \\Hv\\ < d\\v\\ for 
all p- vectors v, where the constant C\ does not depend on p. (Here we 
have used the condition 9\ < 1, which we introduced in Theorem 5.2 and 
also imposed in Theorem 5.3.) Hence A < C\ for all p. Note too that since, 
for finite p, H is nonsingular (see Section 3.2), then its smallest eigenvalue 
A = X(p) is positive. On the other hand, when p = oo the sequence 71, 72, ... , 
that minimizes (3.5) is not unique (see Section 3.2), and so we can have A — > 
as p — > 00. The condition re 1 / 2 A —¥ 00 imposed in Theorem 5.3 reflects this 
property, and essentially puts an upper bound to the speed at which p can 
tend to infinity as a function of n. 

5.3. Implications of the main theorems and additional results. 

5.3.1. Consistency and rates of convergence. Let Xq have the same dis- 
tribution as Xi, . . . ,X n but be independent of those random functions, and 
let II 

' llpred denote the predictive L2 norm, conditional on X\, . . . ,X n : if W 
is a random variable, then ||PF||p rec [ = {E(W 2 \ X\,. . . , An)} 1 ' 2 . For exam- 
ple, taking W = ^ p (Xq) — g(Xo) we obtain a measure of the accuracy with 
which g p (Xo) predicts g(Xo). We shall show in Section 7.6 that if p = p(n) is 
chosen to diverge no faster than n 1 / 2 , and sufficiently slowly to ensure that 

(5.10) n~ 1 / 2 A~ 1 ||7|| +n~ l \- 3 ->0 

as n — > 00, then 

\\9p(Xo) -g(X ) ||pred 

(5.11) 

= Opin-^X-^l + ||7||) + n-'X- 3 + ipOn, . . . , 7p ) 1/2 }, 

where t p is as at (3.5). It follows from Theorem 3.2 that if all of the eigenval- 
ues 9j are nonzero, then ^(71, . . . ,7 P ) — >• as p — > 00. (As remarked in the 
paragraph immediately below that theorem, the condition that each 6j is 
nonzero can be dropped.) Therefore, (5.10) implies that (j p (Xo) is consistent 
forg(X ). 

Additionally, Theorems 5.1-5.3 make it clear that, provided p does not 
diverge too quickly as a function of n, the quantities sup J<p ||iP(&) — iP(6)||, 

supi<,- k<p\hjk — hjk\ and supj< p |7j — jj\ [see (5.12) below] converge in prob- 
ability to zero as n diverges. 

5.3.2. Results in supremum metrics. For our expansions of the func- 
tion K 3 (b) at (5.6), and of the vector 7 at (5.8), our bounds on remainder 
terms are given in L2 metrics. In either case they can be extended to the 
supremum metric. For example, (5.8) itself implies that 

(5.12) sup \j j - 7j -n~ 1 / 2 {H~ 1 (5-A 11 )} j \ = O p (n- 1 X~ 3 ). 

i<i<p 
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Theorem 5.4 below states a version of (5.6) in the metric. It makes 
use of the following regularity conditions: 



for both D; = 1 and D; = Ys 



(5.13) 



sup 

t&x 



1 n 

^^{XityDi-EXiMDi} 



i=l 



o p (i), 



(5.14) 



r i n 

SU P / "T72 5> " E ){ x *(s) ~ EXi( 8 )}{Xi(t) - EXi(t)} 
tax Jx n 7 f—' 

2 = 1 



O p (l). 



Conditions (5.13) and (5.14) will be discussed in Appendix A.l. 

Theorem 5.4. If (5.5), (5.13) and (5.14) hold, then sup teX \£j(t)\ = 
O p (l) for each j , and 

sup \&(b)(t) - {KHb)(t) + n-V2^(t)}| = O p (n~ l ). 
tex 

5.3.3. Interpreting stochastic expansions. The coefficients of n" 1 / 2 in the 
expansions of K^(b)(t) — K^(b)(t), hj k — hj k , Jj — 7j and J p (x) — 7 p (x) 
in (5.6), (5.7), (5.8) [see also (5.12)] and (5.9), respectively, are each equal 
to n" 1 multiplied by a sum of n independent and identically distributed 
random variables with zero mean, plus a term that equals O p (n~ 1 ). In these 
cases, for fixed (j,t), (j,k), j and (p,x), respectively, the independent ran- 
dom variables do not depend on n. Therefore, their variances can be com- 
puted easily. 

For example, in the case of hj k — hj k , using (5.7) and the definitions of £ 
and £o> we have, under the conditions of Theorem 5.2 and for each fixed j 
and k, 



(5.15) 



h jk = h jk + n ls ^Z ijk + O p (n 1 ), 



i=l 



where the independent and identically distributed random variables Zij k , 
Z njk are given by 

Zijk = (1 - E) 



x / K k (b)(u) 



{Yi-E(Yi)} 

x f K j (u,t){Xi(t)- n(t)}dt 
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3-1 , 

+ Y1 {Xi(t)-fi(t)}K%u)dt 



K j+1 (b)(u) 



{Yi-E(Yl)} 

x J K k " 1 (u,t){X i (t) - (jt(t)}dt 

fe-2 . 

+ J2 {Xi(t)-n(t)}K l (t,u)dt 

o—c\ JX 



£=0 



K 



k-£- 



du. 



The distribution of Z^ does not depend on n, and, under the assumption of 
finite fourth moment of X and finite second moment of e [see (5.5)], Z^ has 

7 ffc> " 

that n l l 2 {hjk — hjk) 



finite variance o? u , say. Hence, for each fixed j and k it follows from (5.15) 



is asymptotically normal N(0,(r| fe ) 



5.3.4. Hankel matrix properties. In Section 3.2 we demonstrated that 
a.j = f x^m(dx), where m is the measure that places mass (f3f c ) 2 9 r at the 
point 6 r for r > 1; m has no mass anywhere else. Therefore the p x p matrix 
H = (a.j + k) is a Hankel matrix for which the associated nonnegative mea- 
sure, m, is discrete and compactly supported. The latter property implies 
that m is completely determined by its moments ay, and hence that the 
Hankel matrix is "determinate;" see, for example, Berg and Szwarc (2011). 
In such cases the smallest eigenvalue of H can converge to zero arbitrarily 
fast as p diverges [Berg and Szwarc (2011), Theorem 2.5], although more is 
known about the case where m is a continuous than that of a discrete mea- 
sure, and it is particularly challenging to develop general theory describing 
properties of H~ l in the context of our measures m. [See Lascoux (1990) 
and Hou, Lascoux and Mu (2005) for access to the literature on inverses 
of Hankel matrices and their determinants.] Nevertheless, as we noted in 
Section 3.2, H is generally nonsingular for all p. 



6. Numerical illustrations. In this section we illustrate, numerically, in 
a few examples, the fact that the algorithms in Section 4.2 and Appendix A. 2 
do indeed solve the same problem. We also illustrate the main difference 
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between the PLS basis and the PCA basis, namely that PLS can capture 
the interaction between X and Y using a smaller number of terms than 
PCA. 

In our first illustration, we take the X^s from a real data study, and 
generate the Yj's according to the linear model at (2.1). By choosing the 
population in this way, we can represent, in simulations, the vagaries of real 
data, but we can still compare the performance of our methodology with 
the "truth." We take the Xj curves from a benchmark Phoneme dataset, 
which can be downloaded from www-stat.stanford.edu/ElemStatLearn. In 
these data, Xi(t) represents log-periodograms constructed from recordings 
of different phonemes. The periodograms are available at 256 equispaced fre- 
quencies t, which for simplicity we denote by t = 1, 2, . . . , 256. Hence, in this 
example, 1= [1,256]. See Hastie, Tibshirani and Friedman (2009) for more 
information about this dataset. We used the N = 1717 data curves X{(t) 
that correspond to the phonemes "aa" as in "dark" and "ao" as in "wa- 
ter." 

We computed the first J = 20 empirical PCA basis functions 4>\ (t) , . . . , 
4>2o(t), and considered four different curves 6, which we constructed by tak- 
ing b(t) = Ylj=i a j < l ) j(t) f° r f° ur different sequences of ay's: (i) aj = (— 1) J • 
l{j < 5}; (ii) a, = (-1)* • 1{6 < j < 10}; (iii) a, = (-1)* • 1{11 < j < 15}; 
(iv) a,j = (— l) 3 ■ 1{16 <j< 20}. These four models were chosen to illustrate 
clearly the advantages of the PLS basis over the PCA basis. Example (i) 
illustrates a situation particularly favourable to PCA, where the interaction 
between X and Y can be represented by the first few PCA basis functions. 
There we do not expect that PCA will need many more terms than PLS to 
achieve a small prediction error. On going from example (i) to example (iv), 
the function b is represented by five consecutively indexed PCA basis func- 
tions in each case, but with their indices successively larger. However, as we 
shall see below, in those cases too, PLS manages to construct a basis that 
captures the interaction between X and Y using only the first few terms. 

In the four cases, for i = 1, . . . ,N we generated the l^'s by taking Y^ = 
J x Xib + Si, where Si ~N(0,ct 2 ), and where 5<r 2 was equal to the empirical 
variance of the fj-bXi's calculated from the N observations. Then, in each 
case, we randomly split these N observations in two parts: a training sample 
of size n, and a test sample of size N — n. We did this 200 times for each of 
n = 30, n = 50 and n = 100, so for each setting we generated 200 test and 
training samples. 

For each set of test and training samples generated in this way, we con- 
structed our predictor using only the test sample, and then we applied it 
to predict /j&Aj for each Xi in the associated training sample. In other 

words, we constructed X, Y and b from the training sample only, where b 
was the empirical version of b p calculated either via the first p terms of the 
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PLS basis (calculated from the algorithm in Appendix A. 2 or the second 
algorithm of Section 4.2), or via the first p terms of the PCA basis, for each 
of p = 1, . . . , 10. Then, for each observation Xj in the test sample, we calcu- 
lated the predictor Yi = Y + f x b(Xi — X) of J x bXi. Note that this predictor 
includes the estimator Y — f- bX, L of the intercept because, although our 
data were generated from a model with no intercept, in practice we are not 
supposed to know this. 

To quantify the quality of prediction, we calculated the prediction error 
PE = (N — 77,)" 1 X]£i n (^ — fi^Xi) 2 in each case, for each method, and for 
each test sample. In Figure 1 we show boxplots of these prediction errors 
calculated in each case from the 200 test samples. Note that here the two PLS 
algorithms gave exactly the same estimators, and so the boxplots only show 
the results for the standard PLS algorithm and for the PCA method. These 
boxplots show that as the information about the interaction between X 
and Y moves further away in the sequence of <^j's [i.e., going from case (i) 
to case (iv)], PLS can capture the interaction using fewer terms than PCA. 
For example, in case (i), PLS took p = 3 components to reach the prediction 
error that PCA reached with p = 5, but in case (iv), the prediction error was 
already very small for PLS with p = 10, and was still very large for PCA 
with p = 10. We also calculated the integrated squared error ISE = j x (b — b) 2 
for each method and test sample. In Figure 2 we show boxplots of these ISEs 
calculated from the 200 test samples, for models (i), (iii) and (iv). We can 
see that the PLS estimator of b needs fewer components than the PCA 
estimator to reach small ISE values. 

In our second example we took the orange juice data which comprise 
N = 216 observations (Xi(t), Yj), i = 1, . . . , N, where each is the saccha- 
rose content of a sample of orange juice, and Xi is a curve representing 
the first derivative of near-infrared spectra of the juice at 700 equispaced 
points t. We take t = 1, . . . , 700 (hence X = [1, 700]). The data can be found 
at www.ucl.ac.be/mlg/index.php?page=DataBases. As with our simulated 
data above, we split the observations randomly into a training sample of 
size n and a test sample of size N — n, for each of n = 30, 50 and 100. We did 
this 200 times for each n. Then in each case we calculated our predictor, as 
above, from the training sample, and applied it for predicting f x Xib for the 
corresponding test sample. Here we did not know the true model, so we calcu- 
lated an estimator of the prediction error as PE = (N — n)~ l ^2j^[ n (Yi — Y{) 2 , 
for each (A",, Y^) in the test sample. In this way we obtained 200 values of PE 
for each n. Figure 3 shows, for each n, boxplots of these 200 PE's, for p = 1 
to 8. As above, the two PLS algorithms (the algorithm in Appendix A. 2 and 
the second algorithm of Section 4.2) gave exactly the same results, except 
for p = 8 where the numerical roundings of both methods differed somewhat. 
Therefore we show the boxplots for both algorithms. In this example too we 
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Fig. 1. Boxplots of the prediction error using the first p PLS components (first group 
of 10 boxes) or the first p PC A components (last group of 10 boxes), calculated from 200 
samples of sizes n = 30 (first column) or n = 100 (second column) generated from the 
phoneme data. The curve b is that in cases (i), (ii), (iii) and (iv), in, respectively, rows 1, 
2, 3 and 4- From left to right, each group of 10 boxplots addresses the settings indexed by 
p = 1 to p = 10. 
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Fig. 2. Boxplots of the ISE ofb using the first p PLS components (first group of 10 boxes) 
or the first p PCA components (last group of 10 boxes), calculated from 200 samples of 
sizes n = 30 (first row) or n= 100 (second row) generated from the phoneme data. The 
curve b is that in cases (i), (iii) and (iv), in, respectively, columns 1, 2 and 3. From left 
to right, each group of 10 boxplots addresses the settings indexed by p= 1 to p= 10. 



can see that the two PLS algorithms clearly solve the same problem, and 
that PLS needs fewer terms (i.e., p is smaller) to capture the same interac- 
tions as PCA. This can be advantageous when computing time is an issue, 
for example when a linear prediction is associated with a complex nonpara- 
metric procedure. For example, in Ferraty and Vieu (2006), the linear fit is 
used in combination with nonparametric estimators of E(Y\X). 




Fig. 3. Boxplots of the estimated prediction error using the first p PLS components 
calculated by the algorithm of Appendix A. 2 (first group of 8 boxes) or the second algorithm 
of Section 1^.2 (second group of 8 boxes, denoted by PLS2), or the first p PCA components 
(last group of 8 boxplots). Each box was calculated from 200 samples of sizes n = 30 (first 
column), ?i = 50 (second column) or n = 100 (third column) drawn randomly from the 
orange data. From left to right, each group of 8 boxplots is for p = 1 to p = 8. 
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7. Technical arguments. 

7.1. Proof of Theorem 3.1. Defining a 2 = var(e) we see that the right- 
hand side of (2.9) can be expressed as 

P-l 

cov<!( / bX)-Y,( I b^)[ I X^), I X^ p 



x 



J=1 _ , z 
The partial derivative of the right-hand side here, with respect to ip p , equals 

(7.1) 

The equation in c\~ at (3.2) is the result of adjoining Lagrange multipliers 
on the right-hand side so as to accommodate the first p — 1 constraints 
in (2.10). The factor cq on the right-hand side of (3.1) accommodates the 
last constraint in (2.10). 

7.2. Proof of Theorem 3.2. Recall that C(I) is the space of all square- 
integrable functions on X, and suppose b = J2jfiJ C( l ) j € C(X). Write C P (X) 
for the p-dimensional subspace of C(X) generated by the PC A basis func- 
tions (j> p , and let K p denote the transformation that takes b p = 
Ei<y< p /3J%- eC p (X) to V, ; , p 0^j%. Now, 

{e 1 i-K p )---(e p i-K p )b p = o 

for all b p G C P (X). Therefore, 

(7.2) a b p + ai Kp{bp) + ■■■ + a p KP(b p ) = 

for all b p G C P (X), where ao,...,a p are constants and a$ = 6\ - ■ ■ 6 p . In par- 
ticular, do is nonzero, and so (7.2) implies that, for constants ci, . . . ,c p , 

(7.3) bp = c l K{bp) + --- + CpKP(bp). 

Let P p denote the projection operator that takes b = ^ • (3j C 4>j G C(Z) 
to -P p (6) = b p G C P (X). Since P p and commute, then K^{b p ) = K^P p {b) = 
P p Ki\b). Therefore (7.3) implies that b p = P p { Cl K(b) + ■■■ + c p K p (b)}, or 
equivalently, 

(7.4) P p [b - { Cl K{b) + ■■■ + c p K p (b)}} = 0. 

In view of (7.4), if we let X>(X) denote the vector space generated by K(b), 
K 2 (b),..., and if we define P p {V(X)} = {P p {z):z£ V{X)}, then P p (b) G 
P p {V(X)} for all p. Now, P p {V{X)} C V(X), which is closed under limit 
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operations in L2. Therefore, the limit as p 00 of P p (b), that is 6, must be 
in V(I). 

7.3. Proof of Theorem 5.1. Assume it can be proved that (5.6) holds, 
with £j and rjj satisfying (5.3) and (5.4), for a particular j > 1; in view 
of (5.1), (5.6) is valid for j = 1. Then, 



K j+1 (b)(t) = j K j {b)(s)K{s,t)ds 

= j {K j (b) + n- l / 2 Zj + n-^jj^iK + n~ l / 2 Z + n~ l r])(s,t) ds 
= Ki +l {b)(t) + n~ l l 2 J{K^b){s)^s,t)+^(s)K(s,t)}ds 
+ rT x j{Ki (b)(s)r)(s, t) + Vj (s)K(s, t) + & t)}ds 
+ n~ 3/2 Jfa {s) V (s, t) + ri 3 (s)e(s, t)} ds 
(7.5) +n~ 2 / r )j (s)r]{s,t)ds. 



Therefore, taking to be given by the coefficient of n 1 ^ 2 in (7.5), and 
recalling the definition of Q at (5.2), we have 



e i+1 (t) = jf {^(6)( a )e(a,t)+^(a)iiC(a > t)}ds 

= K^)(t) + Cj(t) = ^ 2 fe-i) (t) + K(Cj-i)(t) + (jit), 

which, on iteration, gives (5.3). 

Finally we derive the bound at (5.4) on the remainder, again arguing by 
induction; assuming that (5.4) holds for j we establish it for j + 1. Tak- 
ing r/j+i to equal n times the sum of the terms in n~ 1 , n~ 3 / 2 and n~ 2 
in (7.5), we deduce that 



Vj+i(t) = J^(b)(s) V (s,t) + Vj (s)K(s,t) + ^( S )as,t)}ds 

+ n- l/2 {£j(s)r)(s,t) + Vj (s)((s,t)}ds + n~ l Vj (s) V (s,t) ds. 



Therefore, 

ll^+iH < ||i^(&)|||N|| + H\\\\\K\W + IIOIIICI +n" 1/2 (llOHW + H 

+ n_1 Wvj II III 7 ? Ill 

< (11^(6)11 + 11011)^2 + 11^11^ 
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< (11*' (6)11 + Uj\\)R2 + {(\\K j -Hb)\\ + He.-ill)^ + U-i\\Ri}Ri 

= {(\\K j m + ii^id + (H^- 1 ^)!! + ||^-i||)i?i}i?2 + H-i\\R\ 

< ^p(\\K k (b)\\ + ii^lD^r'l^ + \\m\\R{, 

where the last identity follows on iteration. Observe too that, by (5.1), r/i = 
770 ■ Therefore, 



(7.6) ll^+ill^^NII + ^E^II^WII + ll^ll)' 

k=l 

Note too that, by (5.2), ||Cj|| < ||if J, (6)|||^l, and so, by (5.3), 

fc-2 

ii&ii^iii^^iiiiiCoii + iiieiiiEin^iiiii^"^ 1 ^)!!- 

Hence, by (7.6), 

M^-^!!! — ^iii^oll < ^2 ^3^i^ fe <T ||^ fc C^)ll Hl^^- 1 HI ||«eo II 



k=l 



k~2 



+ iiieiii^iii^iini^- 1 



j 



R a ^Bt k (\\K k (b)\\ + lK k - 1 lUo\ 



k=X 



j k—2 

(7.7) +R 2 \u\\\^Rt h ^\\\ Ke \\\\\ Kk ~ e ~ 1 

k=l i=0 

Result (5.4) for j + 1 follows from (7.7). 

7.4. Proof of Theorem 5.2. Representation (5.6) implies that 

(7.8) h jk = h jk + n~ 1 ' 2 J^ 3+l K k (b) + K^\b)Ck} + n~ l R jk , 
where 



\Rjk\ 



< 



J^ +1 (b)K k (b), h jk = J^ +1 (b)K k (b), 
J{tj+l£k + K> +1 (b) m + K k {b)r, 3+1 ) 
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Vj+lVk 



< U j+ i\\U k \\ + pp +1 (&)||||ifcll + 11^(6)1111^11 

(7.9) +n- 1 / 2 (||7 7i+1 ||||e fe || + ||0+ilHI%||) + n- 1 ||7 ?i+1 

Next we bound \R jk \. Observe that \\K k f = Y,jOf = OiOf), ||A fc (&)|| 2 = 
Ej^fdzHj) 2 = O(0f) and I f/ HI + + l^oll + ||6II = Op(l) as n -> oo. 
Hence, by (5.2), < ||^ j (&)||||£| = O p (9{), uniformly in j > 1, and there- 
fore by (5.3), 

(7.10) UA=o P U + Y,e k 1 e{- k A = o p (j9{), 

V fc=0 / 

uniformly in j > 1. Note too that 

R{ = {(\IK\\\ + n-^mW + n^lHim = O p (|^||P), 

uniformly in 1 < j < Cn 1 ^ 2 , for any C > 0. More simply, R2 = O p (l). Hence, 
by (5.4), 

/ j'-i i-i \ 

= o P \\\K\r + e iw - *- 1 *? + E m' - * -1 **? 

V k=l k=l / 



(7.11) =O p (\lK\r), 

uniformly in 1 < j < Cn 1 ^ 2 . (Here we have used the property < 9\ < \\\K\\\ < 
1.) Combining (7.9)-(7.11) we find that 

(7.12) R jk = O p {jk9{ +k + 9{\\K\t + 0\\\Kf 

+ n- l l\lKf9\ + \\Kl k 9{) + n- l \\K f +k } 

(7.13) =O p (^ 1 ||| J Sr||| fc +n^|||Al|P +fc ), 

uniformly in 1 < j < k < Cn}l 2 . Theorem 5.2 follows from (7.8) and (7.13). 

7.5. Proof of Theorem 5.3. From (3.10), (3.11) and (4.5) we deduce that 
g p (x) - g p (x) - (Y - EY) 

= E{% j^ x ~ X ^'^ " Ti J^ x ~ EX)K*{b)} 
= E[(%"7 j ) / (x-EX)K*(b) 



1 
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+ 7j J (x - EX){K\b) - K\b)} 



- 7j J^X-EX)K^b) 

J {X-EJC)Ki{b) 
-% ({X-EX){Ki{b)-K\b)} 



(7.14) 



Combining (5.6), (7.10) and the bound \\r]j\\ = Op(|||-K"||p), valid uniformly 
in 1 < j < Cn 1 / 2 for each C > and given in Theorem 5.2, we deduce that 



(7.15) 



\K j (b) - K j {b)\\ < n" 1 / 2 !!^)! +n- 1 \\r lj \\ 

= O p {n~ 1 / 2 j6{ + n- 1 \\\K\\\ j ) 



uniformly in 1 < j < Cn 1 / 2 for each C > 0. 

More simply, \\X — EX\\ = Oy^n" 1 / 2 ). Combining this bound with (7.10), 
(7.14) and the properties ||x|| < C and ||K J (6)|| = O(0{), we deduce that 

S„(x)- 9p (x) -(Y-EY) 



E 



(%-7,0 l (x-EX)K>(b) 



+ jj / (x - EX){K 3 (b) - K 1 (b)} - / {X - EX)Ri (b) 



(7.16) +OJ n- l / 2 J2(\% -7,1 +n- 1 ' 2 \i J \)(j9{ +n- 1 l 2 \\K\V) , 



uniformly in 1 < p < Cn 1 / 2 and ||x|| < C, for each C > 0. Using (5.6) and 
the bound ||?7j|| = O p (|||ii'|p), we deduce from (7.16) that 

g p {x)-g p (x)-{Y -EY) 



E 



(7j-7i) / (x-EX)K>Q>) 



+ 1j J {n~ 1/2 (x - EX)ij -(X- EX)K>(b)} 
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+ O p 



n 



v 

-1/2 



E{i%-^i(^+ n " 1/2 iii^iii r 



(7-17) +^- 1/2 (|%| + |7il)II^IP'} 

uniformly in 1 < p < Cn 1 / 2 and ||x|| < C, for each C > 0. 

Given any pxp matrix M, define its norm by ||M|| = sup„. \\ v \\=i \\Mv\\. 
Writing A for a particular p x p matrix, and recalling that A = X(p) denotes 
the smallest eigenvalue of H, we have ||Ai? _1 || < ||A||/A. Therefore, if H = 
(hjk) is the p x p matrix obtained when hjk is defined as at (4.3), and we 

put A = H — H , then, provided that ||A||/A < p where p E (0, 1) is fixed, we 
have 

(7.18) H^ 1 = (I + H^A)- 1 ^ 1 = [I- £T X A + OpjdlAH/A) 2 }]^ 1 . 

Here the matrix M represented by O p {(||A||/A) 2 } is interpreted as having 
the property ||Mi;|| < (1 — /3) _1 (||A||/A) 2 ||^|| for all p- vectors v (provided that 
II A || /A < p), where on this occasion \\Mv || and \\v\\ denote vector norms of 
the indicated quantities, and ||A|| is the matrix norm of A. 

We know from (5.7) that hjk = hjk + n -1 ! 2 A\jk + n~ 1 A2jk, where 

A ljk = f {i ]+1 K k {b) + K^\b)i k }, 

(7.19) JX . 

|A 2j ,|=O p (^ 1 |||A'||| fc +n- 1 |||^||P +fc ), 

the latter property holding uniformly in 1 < j < k < Cn 1 / 2 . Note too that, 
by (7.10), ll^ll = O p (j6{), uniformly in j > 1, and that \\K^(b)\\ = 0(9{), so 
|Aijfc| = Op{max(j, k)di +k }. Therefore, if we define Ajk = hjk — hjk then, 
since 6\ < \\\K\\\ < 1, we have n^^Kp^jfc = O p (l), uniformly in p < 
Cn 1 / 2 . Hence, ||A|| = O p (n~ 1//2 ), uniformly inp < Cn 1 / 2 , where A = (Ajk) is 
apxp matrix. Therefore, if p is chosen to diverge so slowly that p = 0(n 1 / 2 ) 
and A = \{p) satisfies n 1 / 2 \ — > oo then, by (7.18), 

(7.20) H' 1 = {I- H~ X A + Opin^X- 2 )}!!- 1 , 

uniformly in p < Cn 1 / 2 . [Here O p (n _1 A -2 ) denotes apxp matrix, M say, 
for which ||JVfu||/||«|| = O v {n~ l \~ 2 ) uniformly in nonzero p- vectors v.] Note 
too that, if we define Ai to be the p x p matrix with (j, fe)th element Agjk, 
for I = 1, 2, then, in view of the second formula at (7.19), Yl J2j k<p ^2jk = 
O p (l), and so HA2H = O p (l) uniformly in p < Cn 1 / 2 . Therefore (7.20) and 
the property A = H — H = n~ 1 / 2 A\ + n~ 1 A2 imply that 

(7.21) H- 1 = {I- n~ 1 / 2 H- l A l + Op^A" 2 )}^ 1 . 
[Here we used the fact that A < h\^\ = O(l).] 
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Recalling the definitions of hjk, ctj and ay at (4.3), (4.4) and (3.9), we 
deduce that Sj = hoj. Noting that result (5.7) can be extended to hoj, we 



have that Si 



X..J — ctj + n 1 / 2 Aioj + n "i\20j, wnere i\\Qj ana ^20j are given 
by (7.19). Note too that, by (4.2) and (3.7), % = (^" 1 a) i and jj = (H^ajj, 
where a = (a i, . . . , a P ) T and a = (Si, . . . , 3? P ) T . Since K^(b) = O{0\) uni- 
formly in j > 1, ||7/,-|| = O p ( I ||p ) uniformly in 1 < j < Cn 1 / 2 (see Theo- 
rem 5.2) and ||£j|| = O p (j8^) uniformly in j > 1 [see (7.10)], then, by (5.6), 
\\Ki(b)\\ =O p (6{ + n- 1 / 2 j6{+n- 1 \\\K\p) uniformly in \<j<Cn l l 2 . Using 
formula (4.4) for aj, and the fact that < 9\ < \\\K\\\ < 1, we deduce that 



+ n 1 Aooi, where Aio,- and A? 



(7.22) 



|S||< j^||£(6)|| 2 ||^ 
lj=i 



1/2 



o P (i), 



uniformly in 1 < p < Cn 1 / 2 . 

Therefore, defining 5 = (Aiqi, • ■ • , Aio p ) T , we have, by (7.21), 



7 : 



(7.23) 



i? _1 (a + n~ 1/2 5) - n^^H^AiH^a + C^n^A -3 ) 
7 + n^H^id - A l7 ) + Opin^X- 3 ), 



uniformly in 1 <p < Cn 1 / 2 , where the two vectors denoted by O p (n 1 X 3 ) 
have the property that their norms equal Op(n -1 A~ 3 ) uniformly in 1 < p < 
Cn 1 ' 2 . 

Next we combine (7.17) and (7.23), obtaining 



g p ( X ) - g p ( X ) - (Y - EY) 
V 



j"=i 



n 



-1/2 



{H~ l {5-A 



.7)}jjf(*- 



EX)K j (b) 



f 7j ^{n 



-1/2 



(x - £7X)^- - (X - £Ar)if J '(&)} 



(7.24) 



+ n- 1 ^{\j j - lj \(j9{ + n- 1 / 2 \lK\r) 

+ n -l/2 ( | % | + | 7 .|)|| W - } 



uniformly in 1 < p < Cn 1 / 2 and < C for each C > 0. Here we have used 
the fact that, if V = (Vi, . . . , V P ) T is the vector denoted by O p (n~ 1 A~ 3 ) on 
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the far right-hand side of (7.23), then 



E 



Vj / (x-EX)K j (b) 



< 



E 



(x-EX)K j (b) 



uniformly in 1 < p < Cn 1 / 2 and ||x|| < C, since Ylj>i ll-^ 3 '(^)l| 2 < °°- 

Note too that, since \\K^(b)\\ = O(0{) and ||^|| = O p (jd{), uniformly in 
1 <j < Cn 1 / 2 , then by (7.19), |A yfc | = O p (jk9{ +k ), uniformly in 1 < j, k < 
Cn 1 / 2 , and therefore, 

ha 1 ii 2 <x;e a v*= o i»( 1 )' 

j=l k=l 

||5|| 2 = ^A? 0j =O p (l), 
i=i 

uniformly in 1 <p< Cn 1 / 2 . Hence, by (7.23) and (7.22), 

||7 - 7ll = Op{fi- 1 / 2 A- 1 (||«|| + || A! || || 7 ||) + n-'X- 3 } 
= O p {n- 1 / 2 A" 1 (l + || 7 ||)+n- 1 A- 3 }. 

Therefore, 

E{ 1% - 7i I + ^ V2 II I^IIP ! )+ n- l ' 2 ( |% I + |7il)lW} 
i=i 

= P (ll7-7ll+^ 1/2 ||7ll) 
(7.25) = O^n- 1 / 2 A- X (l + || 7 ||) + n^A" 3 }. 

Result (5.8) is a consequence of (7.23), and (5.9) follows from (7.24) and (7.25). 

7.6. Proof of (5.11). To derive (5.11), note that minor modifications of 
the argument used to derive (5.9) can be employed to show that, under the 
conditions of Theorem 5.3, 

\\g P {X ) - g p (X )\\ pied 



Y-EY 

3=1 



{H-\5 - Ai 7 )}j / {X - EX)Ki{b) 
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7i / {(^o - EX)Zj - n l '\X - EX)Ki{b)} 



prcd 

(7.26) +O p (ra- 1 A" 1 ||7|| +n~ 1 A" 3 ), 

uniformly in p satisfying 1 < p < Cn 1 / 2 , for each C > 0. The predictive norm 
on the right-hand side of (7.26) can be shown to equal O p {n _1 / 2 A" 1 (l + 
H7II)}, and so if (5.10) holds, then 

(7.27) ||? P (X )- 5p (X )|| prcd = O p {n^ 1 /2 A -i (1 + || 7 || ) + n -i A -3 }> 
Since \\g p (X ) - g(X )\\ pvcd = ^(71, . . . ,7 P ) 1/2 then (7.27) implies (5.11). 

APPENDIX 

A.l. Conditions (5.13) and (5.14). Here we give examples where (5.13) 
and (5.14) hold. Assume that E(X) = 0. Then the Karhunen-Loeve expan- 
sion of Xi, founded on the principal component basis introduced in Sec- 

1 /2 

tion 2.2, is given by Xi = ^2j0j £,ij<Pji where the random variables for 
j > 1, are uncorrelated and have zero mean and unit variance. For simplic- 
ity we suppose that they have identical distributions with bounded fourth 
moments, that E(e ) < 00, and that the eigenvalues 0j and eigenvectors 

satisfy the condition X^?=i su Ptex l^jWI < 00 ■ Then, 



E 



sup 

tGX 



n 



(A.l) 



^-^{Xi(t)A-^(t)A} 

oo ^ n 

<Y,o) /2 {su V \Ht)\}E ^(1 - 

J=l 1=1 

00 

< (E& ■ EDtf^e 1 / 2 ^ \<f>j(t)\} < °°, 



E 



sup 

*ex 



1 " 

2(1 " E){Ms) ~ EXi(s)}{Xi(t) - EX(t)} 



i=l 



E 



sup 

tex 



(A.2) <E{&) 



J=l fc = l I 2=1 

i 



2# j /2 {sup|<^)|} 
j=i *ex 



2n 
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where we have used the properties 



E 



n y 

— ^(i-^-A <^{(eiiA) 2 }<(^i-^) 1/a , 

Properties (5.13) and (5.14) follow from (A.l) and (A. 2), respectively. 

A. 2. Conventional implementation via the PLS basis. Inference is based 
on a dataset X = {(X\, Yi), . . . , (X n , Y n )} of independent data pairs dis- 
tributed as (X,Y). We first introduce the centred data x\^ = Xi — X and 

Y^ =Y{ — Y, for 1 < i < n. Here and below, a superscript in square brackets 
denotes the number, or index, of a step in our algorithm. The algorithm goes 
as follows. For j = 1, . . . ,p: 

(1) Estimate ipj by the empirical covariance of X\ j] and y} J] : ^ = 

EILi^x^VllEr^^ 1 ^ 1 !!. 

(2) Fit the models y} j] = Jjlf^+ef 1 and X [ ?\t) = 5 5 (t) JjX^^j + 
r]f\t) by least-squares; that is, take 



i=l 



'[.)] 



i=l 



(3) Calculate Af +1] (i) = xf{t) - Sj(t)J x X^j and y} 3+1] = Y} j] - 

After completion of steps (1) to (3) for all j, define M = {Mj,k)i<j,k<p 
by M" 1 = (J x Sj^ k )i<j,k<p- Then b p (t) = Y%,k=iPkMj )k i>j(t) and g p (x) = 
Y + j x b p (x-X). 



A. 3. Modified Gram-Schmidt algorithm. This algorithm turns a set of 
linearly independent functions vi,...,v p into a set of orthogonal functions 
u±, . . . ,u p , where orthogonality is defined with respect to a scalar product 
(•, •). For example, for the second algorithm in Section 4.2, the scalar product 
between two functions /i and f2 is defined by (/i, $2) = J x f x fi(s)f2(t)K(s, 
t) ds dt. The modified Gram-Schmidt algorithm is described in Lange (1999), 
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Section 7.7. It works as follows: 



for j = l,...,p 




for i= - 1 



end loop i 




(u)',Ui)Ui 




end loop j. 
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